Goto

Collaborating Authors

 Waterville


Adapting Multilingual Models to Code-Mixed Tasks via Model Merging

Kodali, Prashant, Shivkumar, Vaishnavi, Joshi, Swarang, Choudhary, Monojit, Kumaraguru, Ponnurangam, Shrivastava, Manish

arXiv.org Artificial Intelligence

We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.


EDUMATH: Generating Standards-aligned Educational Math Word Problems

Christ, Bryan R., Molitz, Penelope, Kropko, Jonathan, Hartvigsen, Thomas

arXiv.org Artificial Intelligence

Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students' interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models' MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models' MWPs relative to human-written MWPs but consistently prefer our customized MWPs.


What Has Been Lost with Synthetic Evaluation?

Gill, Alexander, Ravichander, Abhilasha, Marasović, Ana

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.


Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research

Petersen, Molly R, Stevenson, Claire E, van der Plas, Lonneke

arXiv.org Artificial Intelligence

Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.


The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems

Belz, Anya, Mille, Simon, Thomson, Craig

arXiv.org Artificial Intelligence

Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.


Findings of the Third Automatic Minuting (AutoMin) Challenge

Shinde, Kartik, Besacier, Laurent, Bojar, Ondrej, Thonet, Thibaut, Ghosal, Tirthankar

arXiv.org Artificial Intelligence

This paper presents the third edition of AutoMin, a shared task on automatic meeting summarization into minutes. In 2025, AutoMin featured the main task of minuting, the creation of structured meeting minutes, as well as a new task: question answering (QA) based on meeting transcripts. The minuting task covered two languages, English and Czech, and two domains: project meetings and European Parliament sessions. The QA task focused solely on project meetings and was available in two settings: monolingual QA in English, and cross-lingual QA, where questions were asked and answered in Czech based on English meetings. Participation in 2025 was more limited compared to previous years, with only one team joining the minuting task and two teams participating in QA. However, as organizers, we included multiple baseline systems to enable a comprehensive evaluation of current (2025) large language models (LLMs) on both tasks.


NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Kim, Hyuntak, Kim, Byung-Hak

arXiv.org Artificial Intelligence

Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.


Data extraction and processing methods to aid the study of driving behaviors at intersections in naturalistic driving

Pundlik, Shrinivas, Choe, Seonggyu, Baker, Patrick, Lee, Chen-Yuan, Al-Madi, Naser, Bowers, Alex R., Luo, Gang

arXiv.org Artificial Intelligence

Naturalistic driving studies use devices in participants' own vehicles to record daily driving over many months. Due to diverse and extensive amounts of data recorded, automated processing is necessary. This report describes methods to extract and characterize driver head scans at intersections from data collected from an in-car recording system that logged vehicle speed, GPS location, scene videos, and cabin videos. Custom tools were developed to mark the intersections, synchronize location and video data, and clip the cabin and scene videos for +/-100 meters from the intersection location. A custom-developed head pose detection AI model for wide angle head turns was run on the cabin videos to estimate the driver head pose, from which head scans >20 deg were computed in the horizontal direction. The scene videos were processed using a YOLO object detection model to detect traffic lights, stop signs, pedestrians, and other vehicles on the road. Turning maneuvers were independently detected using vehicle self-motion patterns. Stop lines on the road surface were detected using changing intensity patterns over time as the vehicle moved. The information obtained from processing the scene videos, along with the speed data was used in a rule-based algorithm to infer the intersection type, maneuver, and bounds. We processed 190 intersections from 3 vehicles driven in cities and suburban areas from Massachusetts and California. The automated video processing algorithm correctly detected intersection signage and maneuvers in 100% and 94% of instances, respectively. The median [IQR] error in detecting vehicle entry into the intersection was 1.1[0.4-4.9] meters and 0.2[0.1-0.54] seconds. The median overlap between ground truth and estimated intersection bounds was 0.88[0.82-0.93].


The Balancing Act of Policies in Developing Machine Learning Explanations

Tjaden, Jacob

arXiv.org Artificial Intelligence

Machine learning models are often criticized as opaque from a lack of transparency in their decision-making process. This study examines how policy design impacts the quality of explanations in ML models. We conducted a classroom experiment with 124 participants and analyzed the effects of policy length and purpose on developer compliance with policy requirements. Our results indicate that while policy length affects engagement with some requirements, policy purpose has no effect, and explanation quality is generally poor. These findings highlight the challenge of effective policy development and the importance of addressing diverse stakeholder perspectives within explanations.


Computational Thinking with Computer Vision: Developing AI Competency in an Introductory Computer Science Course

Chowdhury, Tahiya

arXiv.org Artificial Intelligence

Developing competency in artificial intelligence is becoming increasingly crucial for computer science (CS) students at all levels of the CS curriculum. However, most previous research focuses on advanced CS courses, as traditional introductory courses provide limited opportunities to develop AI skills and knowledge. This paper introduces an introductory CS course where students learn computational thinking through computer vision, a sub-field of AI, as an application context. The course aims to achieve computational thinking outcomes alongside critical thinking outcomes that expose students to AI approaches and their societal implications. Through experiential activities such as individual projects and reading discussions, our course seeks to balance technical learning and critical thinking goals. Our evaluation, based on pre-and post-course surveys, shows an improved sense of belonging, self-efficacy, and AI ethics awareness among students. The results suggest that an AI-focused context can enhance participation and employability, student-selected projects support self-efficacy, and ethically grounded AI instruction can be effective for interdisciplinary audiences. Students' discussions on reading assignments demonstrated deep engagement with the complex challenges in today's AI landscape. Finally, we share insights on scaling such courses for larger cohorts and improving the learning experience for introductory CS students.